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Overview  of  the  Active  Templates  Program 

The  goal  of  the  DARPA  Active  Templates  program  (2000-2004)  has  been  to  develop 
technologies  designed  to  revolutionize  aspects  of  mission  planning,  mission  execution,  and 
related  command  and  control  processes  in  order  to  improve  the  ability  of  military  units  to 
organize,  plan  and  conduct  military  operations.  A  particular  emphasis  has  been  placed  on 
demonstrating  the  application  of  these  technologies  to  problems  faced  in  the  course  of 
conducting  special  operations  and  other  time-critical,  small-unit  missions.  These  prototype 
technologies  have  focused  on  the  temporal  and  spatial  aspects  of  the  associated  mission 
information,  as  well  as  on  the  supporting  infrastructure  necessary  to  implement  them  in  a 
distributive,  collaborative  environment.  Ordinary  users  must  be  able  to  dynamically  define  their 
own  workflow  systems,  interface  to  disparate  data  sources,  and  specify  problems  to  be  solved  by 
advanced  problem-solvers. 

Technologies  like  electronic  mail  and  the  web  have  significant  advantages  and  have  had  huge 
impact  on  military  capability,  but  there  is  technology  beyond  the  e-mail  and  the  web  that 
promises  even  more.  The  objective  of  this  program  was  to  create  a  new  structured  way  of 
communicating  and  operating  that  enables  the  computer  to  build  a  partial  model  of  the  situation. 
With  this  model,  the  computer  will  be  able  to  assist  users  in: 

•  finding  and  integrating  information 

•  prioritizing  requests  and  tasks 

•  determining  key  information  elements  and  available  options 

•  reuse  similar  past  solutions  or  defaults 

•  coordinating  decisions  and  routing  information  to  others 

•  recognizing  important  changes 

This  new  technology  will  provide  a  means,  using  tailored  templates,  for  doing  common  tasks  and 
sharing  data  in  a  dynamic  workflow  system.  These  templates  will  provide  prioritized 
information  that  updates  in  real  time  and  triggers  the  computer  to  automatically  analyze  the 
impact  of  changes,  suggest  default  actions,  auto-coordinate  decisions,  and  capture  a  digital 
history  for  purposes  of  accountability,  training,  and  process  improvement.  Most  importantly, 
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when  we  sketch  out  plans  (for  example),  we’ll  have  share-able  data,  and  this  data  is  necessary  to 
move  beyond  the  spell-checking,  key-word  search  level  of  automation  in  place  today. 

Layered,  Multi-Template  Retrieval,  Adaptation  and  Learning 

Work  on  this  effort  was  based  extensively  on  previous  research  in  generative  planning  and 
learning,  case-based  and  mixed-initiative  plan  adaptation,  real-time  integration  of  action  and 
execution,  and  multi-agent  control  and  learning.  Users  performing  complex  planning  tasks 
should  be  able  to  rely  upon  fully-automated  agent-based  software  tools  -  while  being  able  to 
fully  inspect,  guide,  and  alter  the  software  agents’  actions.  Critical  to  this  is  the  need  for  a 
framework  for  creating  and  managing  template  plans  that  can  anticipate  multiple  contingencies 
and  dynamically  re -plan  based  on  real-time  sensory  information,  both  based  on  the  system’s  own 
execution  and  on  unanticipated  adversarial  actions.  Templates  must  be  able  to  improve 
incrementally  and  new  templates  generated  based  on  past  template-filling  episodes. 

Technical  Challenges 

Simplicity  of  use:  Users  must  be  able  to  examine  the  Special  Operations  Forces  (SOF) 
adaptation  of  templates  in  a  layered  approach  corresponding  to  increasingly  higher  levels  of 
detail. 

Incremental  and  dynamic  template  adaptation:  The  “template-filling”  and  adaptation  process 
must  be  approached  in  a  mixed-initiative,  incremental,  and  rationale-based  way  in  response  to 
dynamically  perceived  relevant  world  changes. 

Multi -tempi ate  management :  Retrieval,  adaptation,  and  introspection  of  multiple  templates  are 
necessary  to  allow  for  user-guided  identification  of  resource  contentions  and  other  conflicts. 
Based  on  template-filling  behaviors,  automated  suggestions  can  be  provided  to  assist  the  user  in 
merging  multiple  templates. 

Template  generation  and  learning:  Based  on  the  user’s  input  and  changes  during  dynamic 
execution  of  a  plan,  the  system  should  be  able  to  learn  templates  of  increasing  quality.  Abstract 
templates  and  partially  instantiated  templates  can  enable  the  generation  of  plan  structures 
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involving  multiple  taskforces  and  multiple  decision  points,  the  data  structures  for  monitoring  the 
execution  of  those  plans. 


Technical  Approach 

The  approach  taken  by  CMU  to  address  the  underlying  technical  challenges  associated  with 
making  Active  Templates  a  viable  Command  and  Control  concept  eventually  came  to  focus  on 
six  themes:  (1)  Allocation  of  communications  spectrum  frequencies,  (2)  Extraction  of  plan 
rational  and  the  learning  of  planning  templates,  (3)  New  abstraction  techniques  for  reinforcement 
learning  to  improve  the  efficiency  of  automatic  control  algorithms,  (4)  Opponent  modeling  in 
dynamic  multi-agent  environment,  (5)  Multi-agent  learning  and  limitations,  and  (6)  Planning 
using  symbolic  model-based  techniques.  Each  of  these  areas  is  described  in  the  following 
section  of  this  report.  An  extensive  bibliography  at  the  end  of  this  report  lists  publications  which 
describe  the  results  of  these  research  tasks  in  more  detail. 

Communication  Frequency  Planner 

A  frequency  resource  allocation  system,  CommPlanner,  was  developed  for  use  in  the  Special 
Forces  domain.  The  first  CommPlanner  prototype  calculated  what  frequency  bands  are  available 
in  an  area  of  interest  by  generating  a  frequency  channel  usage  list  in  the  area  of  interest.  The 
channel  usage  list  is  generated  by  parsing  a  given  communication  resources  database  for  known 
transmitters  and  receivers  in  the  area  of  interest.  Once  the  channel  usage  list  has  been  generated, 
ComPlanner  finds  and  allocates  free  frequency  bands  meeting  the  specified  requirements  of 
bandwidth,  frequency  range  and  tolerance. 

The  CommPlanner  prototype  operates  in  a  stand-alone  manner  with  a  graphical  interface  and  a 
nominal  database.  Operation  of  the  CommPlanner  prototype  was  successfully  demonstrated  to 
the  DARPA  Program  Manager,  SOF  DARPA  consultants,  and  to  operational  users  who  currently 
perform  this  tedious  task  by  hand.  Interaction  with  all  the  above  parties  led  to  a  number  of 
revisions  to  the  prototype. 
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The  revised  CommPlanner  program  is  a  complete  piece  of  software  that  can  be  integrated  with 
other  SOF  tools,  most  notably  the  C2PC  program.  CommPlanner  takes  as  input  specification 
files  that  define  how  many  frequency  bands  are  required,  their  frequency  range,  and  in  what 
rectangular  area  (as  defined  by  two  pair  of  longitude  and  latitude  coordinates).  Using 
information  contained  in  the  XML  input  file,  CommPlanner  then  queries  the  database  to  retrieve 
the  relevant  transmitter  and  receiver  information  that  is  within  the  specified  frequency  range  and 
area  of  interest.  The  next  step  of  the  process,  which  is  currently  under  development,  is  to 
generate  the  frequency  assignments  using  the  previous  code  as  a  starting  point  and  to  convert  it 
to  an  appropriate  XML  output  format.  CommPlanner  can  also  communicate  with  real  database 
files,  e.g.,  FCCRegl.mdb.  CommPlanner  can  retrieve  the  data  record  set  from  the  database  with 
the  information  originated  from  the  XML  file.  So  the  basic  infonnation  flow  is  for 
CommPlanner  to  fetch  the  specification  from  the  input  XML  file,  retrieve  the  record  set  from  the 
database  with  the  fetched  specification,  do  the  frequency  band  calculation  based  on  the  record  set 
and  produces  an  output  file  in  the  legitimate  XML  file  format.  CommPlanner  is  self-contained, 
with  a  simple  GUI  by  which  the  end  user  may  interact. 

Learning  Planning  Templates:  Domain-Specific  Planners 

One  of  the  key  issues  identified  in  template  learning  from  examples  is  the  ability  to  “understand” 
the  reasons  or  rationale  for  the  steps  in  a  given  execution  example. 

Several  tasks,  such  as  plan  reuse  and  agent  modeling,  need  to  interpret  a  given  or  observed  valid 
plan  to  generate  the  underlying  plan  rationale.  Although  there  have  been  several  successful 
contributions  to  this  rationale-extraction  problem,  they  do  not  apply  to  complex  plans,  in 
particular  to  plans  with  actions  that  have  conditional  effects.  SPRAWL,  an  algorithm  to  find  a 
minimally-annotated,  partially-ordered  structure  in  an  observed  totally-ordered  plan  with 
conditional  effects  was  developed.  The  algorithm  proceeds  in  a  two-phased  approach.  First  the 
given  plan  is  preprocessed  using  a  novel  “needs  analysis”  technique  that  builds  a  “needs  tree”  to 
identify  the  dependencies  that  link  all  the  literals,  including  the  conditional  effects,  in  the  totally 
ordered  plan.  This  needs  tree  is  then  further  processed  to  construct  a  partial  ordering  that 
captures  the  complete  rationale  of  the  given  plan. 
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This  capability  was  then  enhanced  to  support  automatically  acquiring  templates  from  example 
plans.  As  is  well  known,  general-purpose  planners  can  solve  problems  in  a  variety  of  domains 
but  can  be  quite  inefficient  in  a  given  domain.  Domain-specific  planners,  on  the  contrary,  are 
more  efficient,  but  are  difficult  to  create.  Template-based  planning  has  been  introduced  as  a 
novel  paradigm  for  automatically  generating  domain-specific  programs,  or  templates.  DISTILL, 
an  algorithm  for  learning  templates  automatically  from  example  plans,  was  thus  developed. 
DISTILL  converts  a  given  example  plan  into  a  template  and  then  merges  it  with  previously 
learned  templates.  Empirical  results  in  simple  domains  have  been  achieved  that  show  that  the 
templates  automatically  learned  by  DISTILL  compactly  represent  its  domain-specific  planning 
experience.  Furthermore,  the  templates  situationally  generalize  the  given  example  plan,  thus 
allowing  them  to  efficiently  solve  problems  that  were  not  previously  encountered. 

This  research  is  core  to  the  Ph.D.  Thesis  of  Elly  Winner,  expected  to  be  finished  in  the  fall  2005, 
entitled  “Learning  Domain-Specific  Planners  from  Example  Plans .” 

Use  of  Macros  in  Control  Learning  for  Complex  Domains 

Methods  for  speeding  up  automatic  control  algorithms  have  been  investigated,  specifically,  new 
abstraction  techniques  for  Reinforcement  Learning  and  Semi-Markov  Decision  Processes 
(SMDPs).  The  use  of  policies  as  temporally  abstract  actions  is  introduced.  This  is  different  from 
previous  definitions  of  temporally  abstract  actions  as  termination  criteria  are  not  involved.  An 
approach  for  processing  previously  solved  problems  to  extract  these  policies  has  been  developed 
utilizing  a  method  for  using  supplied  or  extracted  policies  to  guide  and  speed  up  problem  solving 
of  new  problems.  Extracting  policies  are  treated  as  a  supervised  learning  task  and  a  new 
algorithm,  LUMBERJACK,  that  extracts  repeated  sub-structure  within  a  decision  tree  was 
defined.  Another  algorithm,  TTREE,  that  combines  state  and  temporal  abstraction  to  increase 
problem  solving  speed  on  new  problems  was  also  introduced.  TTREE  solves  SMDPs  by  using 
both  user  and  machine  supplied  policies  as  temporally  abstract  actions  while  generating  its  own 
tree-based  abstract  state  representation.  By  combining  state  and  temporal  abstraction  in  this 
way,  TTREE  is  the  only  known  SMDP  algorithm  that  is  able  to  ignore  irrelevant  or  harmful  sub- 
regions  within  a  supplied  abstract  action  while  still  making  use  of  other  parts  of  the  abstract 
action. 
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This  research  led  to  the  Ph.D.  Thesis  of  William  Uther,  finished  in  August  2002,  entitled  “ Tree- 
Based  Hierarchical  Reinforcement  Learning.” 

Opponent  Modeling  and  Coaching 

A  set  of  algorithms  have  been  developed  which  together  are  able  to  process  observation  logs  of 
continuous,  dynamic  multi-agent  movements  and  actions  in  complex  environments.  The 
algorithms  analyze  these  logs  and  extract  sequential  templates:  i.e.,  important  repeated  instances 
of  sequential  behavior  by  the  observed  agents.  In  other  words,  given  a  set  of  recordings  of  the 
movements  and  actions  of  multiple  agents  the  developed  algorithms  can  extract  repeating 
patterns  of  sequential  interaction.  These  algorithms  have  been  fully  implemented  in  the  context 
of  the  RoboCup  Soccer  domain. 

The  algorithms  work  in  two  stages:  The  first  stage  transforms  a  stream  of  continuous  world-state 
observations,  including  object  and  agent  positions,  into  a  stream  of  categorical  observations 
corresponding  to  recognized  atomic  single-agent  behaviors.  These  algorithms  rely  on  domain 
knowledge  to  identify  instances  of  recognizeable  behavior,  matching  the  continuous  observation 
stream  against  conditions  that  check  for  agent  movement  over  time,  relative  positions  of  agents 
and  objects,  and  spatial  patterns. 

Once  this  stream  of  recognized  behaviors  is  available,  a  second  algorithm  is  used  to  identify 
repeating  patterns.  First,  the  recognized  behavior  stream  is  segmented  by  team  ID  so  that  the 
behaviors  of  different  teams  are  separated.  The  segments  of  each  team  are  inserted  into  a 
separate  trie  structure,  a  tree-like  data  structure  that  supports  efficient  storage  and  access,  while 
maintaining  count  of  how  many  times  each  sequence  and  sub-sequence  appeared.  Finally,  the 
trie  is  traversed  to  assign  a  rank  to  each  sequence  by  means  of  a  ranking  function.  These  ranks 
can  then  be  sorted  for  the  final  output. 

The  problem  of  modeling  observed  execution  has  also  been  extensively  investigated.  An 
advising  agent,  a  coach,  provides  advice  to  other  agents  about  how  to  act.  This  is  accomplished 
by  an  advice  generation  method  using  observations  of  agents  acting  in  an  environment.  Given  an 
abstract  state  definition  and  partially  specified  abstract  actions,  the  algorithm  extracts  a  Markov 
Chain,  infers  a  Markov  Decision  Process  (MDP),  and  then  solves  the  MDP  (given  an  arbitrary 


6 


reward  signal)  to  generate  advice.  This  capability  has  been  evaluated  in  a  simulated  robot-soccer 
environment  and  experimental  results  show  improved  agent  perfonnance  when  using  the  advice 
generated  from  the  MDP  for  both  a  sub-task  and  the  full  soccer  game. 

This  research  is  core  to  the  Ph.D.  Thesis  of  Patrick  Riley,  expected  to  be  finished  in  the  Fall 
2005,  entitled  “ Advice  Generation  from  Modeling  Observed  Execution.’’'’ 

Multi-Agent  Control  Learning 

Learning  to  act  in  a  multi-agent  environment  is  a  challenging  problem.  Optimal  behavior  for  one 
agent  depends  upon  the  behavior  of  the  other  agents,  which  are  learning  as  well.  Multi-agent 
environments  are  therefore  non-stationary  -  violating  the  traditional  assumption  underlying 
single-agent  learning.  In  addition,  agents  in  complex  tasks  may  have  limitations,  such  as 
physical  constraints  or  designer-imposed  approximations  of  the  task  that  make  learning  tractable. 
Limitations  prevent  agents  from  acting  optimally,  which  complicates  the  already  challenging 
problem.  A  learning  agent  must  effectively  compensate  for  its  own  limitations  while  exploiting 
the  limitations  of  the  other  agents.  This  research  theme  focuses  on  these  two  challenges,  namely 
multi-agent  learning  and  limitations,  and  includes  four  main  contributions. 

First,  the  thesis  introduces  the  novel  concepts  of  a  variable  learning  rate  and  the  WoLF  (Win  or 
Leam  Fast)  principle  to  account  for  other  learning  agents.  The  WoLF  principle  is  capable  of 
making  rational  learning  algorithms  converge  to  optimal  policies.  By  doing  so  it  achieves  two 
properties,  rationality  and  convergence,  which  have  not  been  achieved  by  previous  techniques. 
The  converging  effect  of  WoLF  has  been  proven  for  a  class  of  matrix  games,  and  demonstrated 
empirically  for  a  wide-range  of  stochastic  games. 

Second,  the  thesis  contributes  an  analysis  of  the  effect  of  limitations  on  the  game -theoretic 
concept  of  Nash  Equilibria.  The  existence  of  equilibria  is  important  if  multi-agent  learning 
techniques,  which  often  depend  on  the  concept,  are  to  be  applied  to  realistic  problems  where 
limitations  are  unavoidable.  The  thesis  introduces  a  general  model  for  the  effect  of  limitations 
on  agent  behavior,  which  is  used  to  analyze  the  resulting  impact  on  equilibria.  The  thesis  shows 
that  equilibria  do  exist  for  a  few  restricted  classes  of  games  and  limitations,  but  even  well- 
behaved  limitations  do  not  preserve  the  existence  of  equilibria  in  general. 
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Third,  the  thesis  introduces  GraWoLF,  a  general-purpose,  scalable,  multi-agent  learning 
algorithm.  GraWoLF  combines  policy  gradient  learning  techniques  with  the  WoLF  variable 
learning  rate.  The  effectiveness  of  the  learning  algorithm  has  been  demonstrated  in  both  a  card 
game  with  an  intractably  large  state  space,  and  an  adversarial  robot  task.  These  two  tasks  are 
complex  and  agent  limitations  are  prevalent  in  both. 

Fourth,  the  thesis  describes  the  CMDragons  robot  soccer  team  strategy  for  adapting  to  an 
unknown  opponent.  The  strategy  uses  a  notion  of  plays  as  coordinated  team  plans.  The 
selection  of  team  plans  is  the  decision  point  for  adapting  the  team  to  its  current  opponent,  based 
on  the  outcome  of  previously  executed  plays.  The  CMDragons  were  the  first  RoboCup  robot 
soccer  team  to  employ  online  learning  to  autonomously  alter  its  behavior  during  the  course  of  a 
game. 

These  four  contributions  demonstrate  that  it  is  possible  to  effectively  leam  to  act  in  the  presence 
of  other  learning  agents  in  complex  domains  when  agents  may  have  limitations.  The  introduced 
learning  techniques  are  proven  effective  in  a  class  of  small  games,  and  demonstrated  empirically 
across  a  wide  range  of  settings  that  increase  in  complexity. 

This  research  led  to  the  Ph.D.  Thesis  of  Michael  Bowling,  finished  in  August  2003,  entitled 
“Multi-Agent  Learning  in  the  Presence  of  Agents  with  Limitations.  ” 

Efficient  Planning  using  Symbolic  Model-Based  Techniques 

Automated  planning  considers  selecting  and  sequencing  actions  in  order  to  change  the  state  of  a 
discrete  system  from  some  initial  state  to  some  goal  state.  This  problem  is  fundamental  in  a  wide 
range  of  industrial  and  academic  fields  including  robotics,  automation,  embedded  systems,  and 
operational  research.  Planning  with  non-deterministic  actions  can  be  used  to  model  dynamic 
environments  and  alternative  action  behavior.  One  of  the  currently  best  known  approaches  is  to 
employ  reduced  ordered  Binary  Decision  Diagrams  (BDDs)  to  represent  and  generate  plans 
using  techniques  developed  in  symbolic  model  checking.  However,  the  approach  is  challenged 
by  a  frequent  blow-up  of  the  BDDs  representing  the  search  frontier  and  a  limited  number  of 
solution  classes. 

This  thesis  addresses  both  of  these  problems.  With  respect  to  the  first,  it  contributes  a  general 
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framework  called  state-set  branching  that  seamlessly  combines  classical  heuristic  search  and 
BDD-based  search.  Our  experimental  results  show  that  the  perfonnance  of  state-set  branching 
often  dominates  both  blind  BDD-based  search  and  ordinary  heuristic  search.  In  addition,  it 
consistently  outperforms  any  previous  approach  to  guide  a  BDD-based  search  of  which  we  are 
aware.  We  show  that  state-set  branching  naturally  generalizes  to  non-deterministic  planning  and 
introduce  heuristically  guided  versions  of  the  current  BDD-based  non-deterministic  planning 
algorithms. 

With  respect  to  the  second  problem,  the  thesis  introduces  two  frameworks  called  fault  tolerant 
planning  and  adversarial  planning.  Fault  tolerant  planning  addresses  domains  where  non¬ 
determinism  is  caused  by  rare  errors.  The  current  solution  classes  handle  this  situation  poorly  by 
taking  all  fault  combinations  into  account  or  produce  solutions  which  are  too  weak.  The  thesis 
contributes  a  new  class  of  solutions  called  fault  tolerant  plans  that  are  robust  to  a  limited  number 
of  faults.  In  addition,  it  introduces  specialized  BDD-based  algorithms  for  synthesizing  fault 
tolerant  plans. 

Adversarial  planning  considers  situations  where  non-detenninism  is  caused  by  uncontrollable, 
but  known,  environment  actions.  The  current  solution  classes  of  BDD-based  non-deterministic 
planning  assume  a  “friendly”  environment  and  may  never  reach  a  goal  state  if  the  environment  is 
hostile  and  informed.  The  thesis  contributes  efficient  BDD-based  algorithms  for  synthesizing 
winning  strategies  for  such  problems. 

This  research  led  to  the  Ph.D.  Thesis  of  Rune  Jensen,  finished  in  August  2003,  entitled  “ Efficient 
BDD-Based  Planning  for  N on-Deterministic,  Fault-Tolerant,  and  Adversarial  Domains.’’'’ 
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